In this document there will be statistical analysis of Users table. Users table has 79 variables (columns) and 18688 records. It contains demographic and partly-usage data of all of the users. For the analysis we will be using libraries:
The column names are:
Index(['id', 'email', 'encrypted_password', 'reset_password_token',
'reset_password_sent_at', 'remember_created_at', 'created_at',
'updated_at', 'gender', 'date_of_birth', 'height', 'weight',
'activity_level', 'goal', 'body_type', 'body_fat',
'newsletter_subscription', 'is_admin', 'names', 'last_name',
'sign_in_count', 'current_sign_in_at', 'last_sign_in_at',
'current_sign_in_ip', 'last_sign_in_ip', 'recover_password_code',
'recover_password_attempts', 'facebook_uid',
'workout_setting_voice_coach', 'workout_setting_sound',
'workout_setting_vibration', 'workout_setting_mobility',
'workout_setting_cardio_warmup', 'workout_setting_countdown',
'notifications_setting', 'training_days_setting', 'google_uid',
'language', 'country', 'points', 'scientific_data_usage', 't1_push',
't1_core', 't1_legs', 't1_full', 't1_push_exercise', 't1_pull_up',
't2_reps', 't2_steps', 't2_reps_push', 't2_reps_core', 't2_reps_legs',
't2_reps_full', 't2_time_push', 't2_time_core', 't2_time_legs',
't2_time_full', 't1_full_exercise', 't1_pull_up_exercise',
'warmup_setting', 'warmup_session_id', 'stripe_id', 'provider', 'uid',
'best_weekly_streak', 'current_weekly_streak', 'affiliate_code',
'affiliate_code_signup', 'total_sessions', 'total_time',
'kcal_per_session', 'reps_per_session', 'moengage_id', 'mix_panel_id',
'apple_id_token', 'imported', 'platform', 'login_token',
'login_token_generated_at'],
dtype='object')
For analysis multiple columns will be omitted due to data sensivity and irrelevance.
The columns that will be analyzed are:
Types of variables had to be changed to the suitable ones. Also, in the categorical variables (in most places) the numbers were replaced with string factors. Summary of nulls and data types are given below.
In total, there are 18688 observations.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18688 entries, 0 to 18687 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 18688 non-null category 1 created_at 18688 non-null datetime64[ns] 2 updated_at 18688 non-null datetime64[ns] 3 gender 18688 non-null category 4 date_of_birth 18688 non-null datetime64[ns] 5 height 18688 non-null float64 6 weight 18688 non-null float64 7 activity_level 18688 non-null category 8 goal 18688 non-null category 9 body_type 18688 non-null category 10 body_fat 18688 non-null float64 11 newsletter_subscription 18688 non-null bool 12 notifications_setting 18688 non-null bool 13 training_days_setting 18688 non-null bool 14 language 18688 non-null category 15 country 6352 non-null category 16 points 18688 non-null int64 17 scientific_data_usage 18688 non-null bool 18 best_weekly_streak 18688 non-null int64 19 affiliate_code_signup 867 non-null category 20 total_sessions 3640 non-null float64 21 total_time 3640 non-null float64 22 kcal_per_session 3640 non-null float64 23 reps_per_session 3640 non-null float64 24 height[m] 18688 non-null float64 25 BMI 18522 non-null float64 26 BMI_category 18522 non-null category dtypes: bool(4), category(9), datetime64[ns](3), float64(9), int64(2) memory usage: 2.9 MB
The data was split into numerical and categorical/boolean data.
The variables taken as numerical data are:
Table with summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below.
| count | mean | std | min | 25% | 50% | 75% | max | var | skewness | kurtosis | NULL count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| height | 18688.00 | 169.67 | 23.09 | 0.00 | 164.00 | 171.00 | 178.00 | 1780.00 | 533.18 | 19.35 | 1475.26 | 0 |
| weight | 18688.00 | 73.16 | 15.84 | 22.00 | 62.00 | 72.00 | 82.00 | 277.00 | 250.79 | 1.33 | 7.19 | 0 |
| body_fat | 18688.00 | 24.28 | 8.60 | 2.00 | 20.00 | 25.00 | 30.00 | 80.00 | 73.93 | 0.66 | 0.40 | 0 |
| points | 18688.00 | 19478.15 | 93727.46 | 0.00 | 0.00 | 100.00 | 5047.00 | 2749450.00 | 8784837124.08 | 13.16 | 251.82 | 0 |
| best_weekly_streak | 18688.00 | 0.85 | 3.21 | 0.00 | 0.00 | 0.00 | 0.00 | 49.00 | 10.29 | 7.12 | 66.85 | 0 |
| total_sessions | 3640.00 | 18.79 | 35.61 | 1.00 | 2.00 | 5.00 | 19.00 | 922.00 | 1268.12 | 7.26 | 128.14 | 15048 |
| total_time | 3640.00 | 23281.25 | 45236.45 | 0.00 | 1539.50 | 5115.50 | 21869.00 | 622509.00 | 2046336681.33 | 3.94 | 23.48 | 15048 |
| kcal_per_session | 3640.00 | 48.99 | 144.33 | 0.00 | 5.15 | 24.08 | 68.00 | 4147.00 | 20830.65 | 19.22 | 461.92 | 15048 |
| reps_per_session | 3640.00 | 10355.27 | 575130.19 | 0.00 | 11.00 | 45.00 | 124.00 | 34597012.00 | 330774735472.08 | 59.82 | 3597.13 | 15048 |
| BMI | 18522.00 | 24.92 | 4.47 | 0.27 | 22.05 | 24.22 | 26.87 | 87.62 | 19.97 | 1.54 | 7.54 | 166 |
There are 18688 users in the data table, that means 18688 people installed and signed up to the application. Among the users median height is 171 cm (with IQR 164-178), mean height is 169.67 cm (SD 23.09) and maximum height is 1780 cm. Median weight is 72 kg (IQR 62-82), where minimum is 22 kg and maximum 277 kg. Mean weight is 73.16 kg (SD 15.84). Median and mean body fat are respectively 25% (IQR 20% - 30%) and 24.28% (SD 8.6), while minimum given body fat is 2% and maximum 80%. Median value of points is 100 (IQR 0 - 5047), maximum is 2749450 and mean is 19478.15 (SD 93727.46). Best_weekly_streak among all of the users is 49 weeks, median is 0 IQR (0 - 0) and mean is 0.85 (SD 3.21). Median and mean values of total_session are respectively 5 (IQR 2 - 19) and 18.79 (SD 35.61). Maximum value is 922 sessions. Median total_time (that is in minutes?) is 5115.5 (IQR 1539.5 - 21869), mean is 23281.25 (SD 45236.45) and the minimum and maximum value are respectively 0 and 622509. Average value of burned kilo calories per session (for every user separately) has median 24.08 kcal (IQR 5.15 - 68) and the mean is 48.99 kcal (SD 144.33). Average number of reps per session (for every user separately) has median 45 (IQR 11 -124), maximum value is 34597012 and mean is 10355.27 (SD 575130.19). (There are extreme outliers here) Median BMI is 24 - normal weight group (IQR 22-27), minimum is 0.27 (probably a mistake made by user), mean BMI value is 25 - overweight (SD 4) and the maximum is 88 (probably also a mistake made by user - extreme outlier).
It is seen, that there are a lot of outliers in the data (maybe some of them could be a mistake while inserting data - human error).
To see if continous data is normally distiruted, histograms, qqplots and shapiro test was used. All of them are given below.
Text(0.5, 0.98, 'Histogram plots for all numeric variables')
Text(0.5, 1.05, 'QQ plots for all numeric variables')
| height | weight | body_fat | points | best_weekly_streak | total_sessions | total_time | kcal_per_session | reps_per_session | BMI | |
|---|---|---|---|---|---|---|---|---|---|---|
| W | 0.35 | 0.94 | 0.95 | 0.19 | 0.28 | 0.51 | 0.54 | 0.20 | 0.01 | 0.91 |
| pval | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| normal | False | False | False | False | False | False | False | False | False | False |
There is no normality in continous data. From the plots, the only variable suspected for normality is body fat, but Shapiro-Wilk test shows that there is no normality in data.
Looking at skewness and kurtosis (from summary statistics), it is also seen that there is no normal distribution in the data.
Reminder:
Kurtosis:
Skewness:
It is possible to check from which distribution data can come from (or is the closest to). Here will be used distfit function from distfit package. Every variable will be checked separately. The criterion of determination for best fit is RSS (residual sum of squares). The RSS describes the deviation predicted from actual empirical values of data. A small RSS indicates a tight fit of the model to the data. RSS is computed by
$$ RSS = \sum_{i=1}^{n} \left(y - f(x_i)\right)^2 $$where $y_i$ is the i-th value of the variable to be predicted, $x_i$ is the i-th value of the explanatory variable, and $f(x_i)$ is the predicted value of $y_i$ (also termed as $\hat{y_i}$). (Source: https://erdogant.github.io/distfit/pages/html/Parametric.html) In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.
[distfit] >fit.. [distfit] >transform.. [distfit] >[exponnorm] [0.16 sec] [RSS: 4.42371e-06] [loc=164.793 scale=18.300] [distfit] >[t ] [0.82 sec] [RSS: 1.6207e-05] [loc=168.602 scale=20.232] [distfit] >[hypsecant] [0.05 sec] [RSS: 1.75498e-05] [loc=170.913 scale=7.824] [distfit] >[betaprime] [0.47 sec] [RSS: 1.7312e-05] [loc=-704.847 scale=2062.608] [distfit] >[logistic ] [0.01 sec] [RSS: 1.81518e-05] [loc=170.845 scale=6.554] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[exponnorm ] [0.12 sec] [RSS: 4.89187e-05] [loc=61.338 scale=10.128] [distfit] >[gengamma ] [0.85 sec] [RSS: 4.95912e-05] [loc=7.526 scale=0.225] [distfit] >[t ] [0.83 sec] [RSS: 5.2237e-05] [loc=72.151 scale=13.086] [distfit] >[logistic ] [0.00 sec] [RSS: 5.3337e-05] [loc=72.211 scale=8.539] [distfit] >[tukeylambda] [14.5 sec] [RSS: 5.35242e-05] [loc=72.272 scale=8.754] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[dgamma ] [0.05 sec] [RSS: 0.0403587] [loc=22.782 scale=3.615] [distfit] >[dweibull ] [0.08 sec] [RSS: 0.0414598] [loc=23.232 scale=7.570] [distfit] >[genlogistic] [0.13 sec] [RSS: 0.0419564] [loc=5.779 scale=6.750] [distfit] >[invweibull ] [0.73 sec] [RSS: 0.0419672] [loc=-586064097.315 scale=586064117.542] [distfit] >[gumbel_r ] [0.00 sec] [RSS: 0.0419726] [loc=20.249 scale=7.190] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halflogistic ] [0.16 sec] [RSS: 5.80151e-13] [loc=-0.000 scale=17664.683] [distfit] >[genhalflogistic] [0.42 sec] [RSS: 9.03931e-13] [loc=-30.607 scale=18800.076] [distfit] >[gompertz ] [0.25 sec] [RSS: 1.37191e-11] [loc=-0.000 scale=29654204083238240.000] [distfit] >[expon ] [0.00 sec] [RSS: 1.89989e-11] [loc=0.000 scale=19478.150] [distfit] >[pareto ] [0.01 sec] [RSS: 1.89989e-11] [loc=-4398046511103.998 scale=4398046511103.998] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[lomax ] [0.12 sec] [RSS: 0.0319116] [loc=-0.000 scale=3.643] [distfit] >[gompertz] [0.26 sec] [RSS: 0.0411965] [loc=-0.000 scale=5457980424427.981] [distfit] >[expon ] [0.00 sec] [RSS: 0.0431919] [loc=0.000 scale=0.851] [distfit] >[pareto ] [0.01 sec] [RSS: 0.0431919] [loc=-134217728.000 scale=134217728.000] [distfit] >[genexpon] [1.58 sec] [RSS: 0.0431971] [loc=-0.000 scale=1.709] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: lomax'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
For
the distribution cannot be compute, because of the NULL values. It will be fitted in a later part of this document.
Most frequent distributions are exponentially modified Gaussian distribution (exponnorm) and half logistic, then there are dgamma, truncated normal and lomax.
Detection of outliers can be done, when the 'right' definition of outlier will be chosen and applied. It is not done in this analysis.
"\n\nExample of outlier detection and deleting (taking whole dataset into consideration)\nExample for height:\n\n1. we look at the distribution plot of “height” feature\nsns.distplot(num_table['height'])\n\n2. We look at the box-plot of “height” feature\nsns.boxplot(num_table['height'])\n\n3. We calculate 99% and 1% quantile of height\nupper_limit = num_table['height'].quantile(0.99)\nlower_limit = num_table['height'].quantile(0.01)\n\n4. Apply trimming\nnew_num_table = num_table[(num_table['height'] <= upper_limit) & (num_table['height'] >= lower_limit)]\n\n5. Compare the distribution and box-plot after trimming\n\nsns.distplot(new_num_table['height'])\nsns.boxplot(new_num_table['height'])\n\nWinsorization :\n\n6. Apply Capping(Winsorization)\n\nnum_table['height'] = np.where(num_table['height'] >= upper_limit,\n upper_limit,\n np.where(num_table['height'] <= lower_limit,\n lower_limit,\n num_table['height']))\n\n7. Compare the distribution and box-plot after capping\n\nsns.distplot(num_table['height'])\nsns.boxplot(num_table['height'])\n\n"
The variables taken as categorical are:
Data can be looked through frequency tables with percentages that are shown below.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 7771.00 | 41.58% | 41.58% | |
| male | 10917.00 | 58.42% | 100.0% | |
| Total | 18688.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 2168.00 | 11.6% | 11.6% | |
| active | 9728.00 | 52.05% | 63.66% | |
| sedentary | 6792.00 | 36.34% | 100.0% | |
| Total | 18688.00 | 100.0% | - | |
| Goal | ||||
| lose | 8257.00 | 44.18% | 44.18% | |
| gain | 7838.00 | 41.94% | 86.12% | |
| antiaging | 2593.00 | 13.88% | 100.0% | |
| Total | 18688.00 | 100.0% | - | |
| Language | ||||
| en | 1245.00 | 6.66% | 6.66% | |
| es | 17443.00 | 93.34% | 100.0% | |
| Total | 18688.00 | 100.0% | - | |
| Body_type | ||||
| thin | 7653.00 | 40.95% | 40.95% | |
| mid | 8791.00 | 47.04% | 87.99% | |
| strong | 2244.00 | 12.01% | 100.0% | |
| Total | 18688.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 10528.00 | 56.84% | 56.84% | |
| Obesity | 2145.00 | 11.58% | 68.42% | |
| Overweight | 5336.00 | 28.81% | 97.23% | |
| Underweight | 513.00 | 2.77% | 100.0% | |
| Total | 18522.00 | 100.0% | - |
From the cumulated frequency tables it is seen, that in those categorical variables there are no NULLs. Females are 42% and male are 58% of the users population. Over half of the users set their activity level as active (52%). Much less users decided that their activity is sedentary (36%) and very active (12%). Similar number of users decided that their goal would be losing weight (44%) or gaining weight (42%), the smallest group (14%) decided for antiaging goal. 93% of users chose Spanish language and 7% chose English. Most of the responders decided that their body type is mid (47%), then thin (41%) and the smallest group is strong (12%). Almost 57% of users have weight in normal, 11% have obesity, 29% are overweight and 3% are underweight.
| Total | ES | US | AR | MX | CL | DE | GB | FR | CO | ... | HR | LB | KG | DZ | ET | EU | RS | GG | HU | LU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 6352 | 4866 | 273 | 219 | 186 | 136 | 69 | 67 | 64 | 50 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Percent | 100.0% | 76.61% | 4.3% | 3.45% | 2.93% | 2.14% | 1.09% | 1.05% | 1.01% | 0.79% | ... | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% | 0.02% |
2 rows × 80 columns
From the countries chosen by users, the most frequent one was Spain (77%). Then was big 'drop' and USA (4%) and Argentina (3%). That explains why so many users chose Spanish as main language of the app. There is a lot of NAs in chosen country, because total of frequency counts is 6352 out of 18688, that means that 34% of users decided to choose a country of living.
| Total | endika | mariapelazas | fitness_revolucionario | mammothhunters | lifestyle_con_blanca | keto_aove | gloria_martinez | cristinamanyer | martina_ferrer_ | ... | nicotononpt | pablo_kuhnert | maria_mendoza_a | Anavb87 | lilifitme | janetgzzl | fullmusculo | eat2winmedia | anabel_freyes | healthybyjane | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 867 | 271 | 108 | 97 | 83 | 77 | 53 | 44 | 37 | 23 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Percent | 100.0% | 31.26% | 12.46% | 11.19% | 9.57% | 8.88% | 6.11% | 5.07% | 4.27% | 2.65% | ... | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% | 0.12% |
2 rows × 28 columns
Only 867 of users (5% of all users) used affiliate code for sign up. Most frequent one was endika (31%), mariapelazas (12%) and fitness_revolucionario (11%).
The variables taken as boolean are:
Data can be looked through frequency tables with percentages after converting it to categorical values.
| Frequency | Percent | ||
|---|---|---|---|
| Variable | factors | ||
| scientific_data_usage | |||
| False | 12830.00 | 68.65% | |
| True | 5858.00 | 31.35% | |
| Total | 18688.00 | 100.0% | |
| newsletter_subscription | |||
| False | 5230.00 | 27.99% | |
| True | 13458.00 | 72.01% | |
| Total | 18688.00 | 100.0% | |
| notifications_setting | |||
| False | 107.00 | 0.57% | |
| True | 18581.00 | 99.43% | |
| Total | 18688.00 | 100.0% | |
| training_days_setting | |||
| True | 18688.00 | 100.0% | |
| Total | 18688.00 | 100.0% |
From the boolean data, only 31% of users agreed on scientific usage of their data. 72% of users agreed on newsletter subscription, 99% agreed on notifications setting and all of them chose to set training days setting.
Looking at numeric data, there is only 3621 valid observations. In this valid data (according to numeric variables) there is only 1098 valid country observations, 2308 valid current_last_sign_in and last_sign_in_at observations and 115 valid observations of affiliate_code_signup.
Below there is a barplot with with count of null data in every numeric variable and information about every variable (type, no-NULL count).
<class 'pandas.core.frame.DataFrame'> Int64Index: 3621 entries, 1 to 18660 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3621 non-null category 1 created_at 3621 non-null datetime64[ns] 2 updated_at 3621 non-null datetime64[ns] 3 gender 3621 non-null category 4 date_of_birth 3621 non-null datetime64[ns] 5 height 3621 non-null float64 6 weight 3621 non-null float64 7 activity_level 3621 non-null category 8 goal 3621 non-null category 9 body_type 3621 non-null category 10 body_fat 3621 non-null float64 11 newsletter_subscription 3621 non-null bool 12 sign_in_count 3621 non-null int64 13 current_sign_in_at 2308 non-null datetime64[ns] 14 last_sign_in_at 2308 non-null datetime64[ns] 15 notifications_setting 3621 non-null bool 16 training_days_setting 3621 non-null bool 17 language 3621 non-null category 18 country 1098 non-null category 19 points 3621 non-null int64 20 scientific_data_usage 3621 non-null bool 21 best_weekly_streak 3621 non-null int64 22 current_weekly_streak 3621 non-null int64 23 affiliate_code_signup 115 non-null category 24 total_sessions 3621 non-null float64 25 total_time 3621 non-null float64 26 kcal_per_session 3621 non-null float64 27 reps_per_session 3621 non-null float64 28 height[m] 3621 non-null float64 29 BMI 3621 non-null float64 30 BMI_category 3621 non-null category dtypes: bool(4), category(9), datetime64[ns](5), float64(9), int64(4) memory usage: 1.2 MB
The variables taken as numerical data are:
NULL values, in table with numerical data, occur only for variables total_sessions, total_time, kcal_per_session, reps_per_session. In those variables there are only 3640 valid observations (19.48% of all observations). Taking into consideration only valid data, the summary statistics will be much different.
| count | mean | std | min | 25% | 50% | 75% | max | var | skewness | kurtosis | NULL count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| height | 3621.00 | 171.31 | 9.22 | 142.00 | 164.00 | 171.00 | 178.00 | 221.00 | 85.01 | -0.00 | -0.24 | 0 |
| weight | 3621.00 | 71.69 | 14.51 | 40.00 | 61.00 | 71.00 | 80.00 | 277.00 | 210.55 | 1.38 | 11.57 | 0 |
| body_fat | 3621.00 | 23.70 | 8.23 | 6.00 | 20.00 | 21.00 | 30.00 | 50.00 | 67.66 | 0.63 | 0.25 | 0 |
| points | 3621.00 | 56737.84 | 167897.30 | 0.00 | 300.00 | 2100.00 | 37673.00 | 2749450.00 | 28189503440.45 | 7.00 | 71.64 | 0 |
| best_weekly_streak | 3621.00 | 4.38 | 6.13 | 1.00 | 1.00 | 2.00 | 5.00 | 49.00 | 37.62 | 3.35 | 14.18 | 0 |
| total_sessions | 3621.00 | 18.85 | 35.69 | 1.00 | 2.00 | 5.00 | 19.00 | 922.00 | 1273.88 | 7.25 | 127.59 | 0 |
| total_time | 3621.00 | 23354.22 | 45339.17 | 0.00 | 1538.00 | 5104.00 | 21914.00 | 622509.00 | 2055640595.21 | 3.93 | 23.35 | 0 |
| kcal_per_session | 3621.00 | 48.92 | 144.61 | 0.00 | 5.09 | 24.08 | 68.00 | 4147.00 | 20913.17 | 19.21 | 460.72 | 0 |
| reps_per_session | 3621.00 | 10409.11 | 576637.05 | 0.00 | 11.00 | 45.00 | 125.00 | 34597012.00 | 332510290413.48 | 59.67 | 3578.36 | 0 |
| BMI | 3621.00 | 24.32 | 3.99 | 11.88 | 21.78 | 23.68 | 26.03 | 87.62 | 15.90 | 2.26 | 20.31 | 0 |
Median and IQR of height is the same, but mean rose from 169.67 cm to 170.43 cm and SD decreased from 23.09 to 15.23. Maximum height also decreased - from 1780 cm to 221 cm. Mean and median value of weight decreased respectively from 73.16 kg to 71.91 kg (SD also from 15.84 to 15.34) and from 72 kg to 71 kg. Minimum value increased from 22 kg to 40 kg. Mean value of body_fat decreased from 24.28% to 23.69% (also SD decreased from 8.60 to 8.22). Body_fat median decreased by 4 percent points (from 25% to 21%) and maximum decreased by 30 percent points (from 80% to 50%). In number of points everything increased except minimum and maximum value - they stayed the same. Mean went from 19478.15 to 57203.82, SD from 93727.46 to 169665.95, median from 100 (IQR 0 - 5047) to 2100 (300 - 37676.5). Best_weekly_streak among all of the users stayed at 49 weeks, median increased from 0 IQR (0 - 0) to 2 (IQR 1 - 5) and mean also increased from 0.85 (SD 3.21) to 4.37 (SD 6.12). Median value of total_session didnt change and is 5 (IQR 2 - 19), mean increased from 18.79 (SD 35.61) to 18.85 (SD 35.69). Maximum value stayed the same at 922 sessions. Median total_time (that is in minutes?) decreased from 5115.5 (IQR 1539.5 - 21869) to 5104 (IQR 1538 - 21914), mean increased from 23281.25 (SD 45236.45) to 23354.22 (SD 45339.17) and the minimum and maximum values stayed the same at respectively 0 and 622509. Average value of burned kilo calories per session (for every user separately) stayed the same at median 24 kcal (IQR 5 - 68) and the mean is 49 kcal (SD 144). Average number of reps per session (for every user separately) stayed the same at median 45 (IQR 11 -124), maximum value stayed the same at 34597012 and mean increased from 10355.27 (SD 575130.19) to 10409.11 (SD 576637). (There are extreme outliers here) Median BMI stayed at 24 - normal weight group (IQR 22-26), minimum increased from 0.27 (probably a mistake made by user) to 12, mean BMI value decreased from 25 - overweight (SD 4) to 24 - normal weight (SD 4) and the maximum stayed the same at 88 (probably also a mistake made by user - extreme outlier).
The normality of this subset of data is checked by the same method as previously.
Text(0.5, 0.98, 'Histogram plots for all numeric variables without NULLs')
Text(0.5, 1.05, 'QQ plots for all numeric variables without NULLs')
| height | weight | body_fat | points | best_weekly_streak | total_sessions | total_time | kcal_per_session | reps_per_session | BMI | |
|---|---|---|---|---|---|---|---|---|---|---|
| W | 0.99 | 0.95 | 0.94 | 0.36 | 0.59 | 0.51 | 0.54 | 0.19 | 0.01 | 0.89 |
| pval | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| normal | False | False | False | False | False | False | False | False | False | False |
As previously, there is no normality in data, even when the NULL data observations are omitted. Skewness and kurtosis are another proof of non-normality of data.
Again, the distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.
[distfit] >fit.. [distfit] >transform.. [distfit] >[loggamma ] [0.08 sec] [RSS: 0.00346306] [loc=-1688.805 scale=274.301] [distfit] >[chi ] [0.09 sec] [RSS: 0.0222726] [loc=142.000 scale=2.558] [distfit] >[johnsonsb] [0.40 sec] [RSS: 0.00347174] [loc=-7155.538 scale=10180.728] [distfit] >[powernorm] [0.12 sec] [RSS: 0.00347359] [loc=170.358 scale=8.902] [distfit] >[logistic ] [0.00 sec] [RSS: 0.00381464] [loc=171.365 scale=5.378] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: loggamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[f ] [0.13 sec] [RSS: 6.39223e-05] [loc=-4.684 scale=74.088] [distfit] >[lognorm ] [0.13 sec] [RSS: 6.37886e-05] [loc=11.428 scale=58.634] [distfit] >[maxwell ] [0.01 sec] [RSS: 5.45636e-05] [loc=37.901 scale=21.233] [distfit] >[betaprime] [0.16 sec] [RSS: 6.0882e-05] [loc=-0.851 scale=25.570] [distfit] >[erlang ] [0.05 sec] [RSS: 6.11575e-05] [loc=29.325 scale=4.816] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[dgamma ] [0.02 sec] [RSS: 0.167269] [loc=22.741 scale=3.269] [distfit] >[dweibull ] [0.03 sec] [RSS: 0.169236] [loc=22.851 scale=7.322] [distfit] >[gumbel_r ] [0.00 sec] [RSS: 0.171813] [loc=19.832 scale=6.896] [distfit] >[invweibull ] [0.18 sec] [RSS: 0.171811] [loc=-625523837.233 scale=625523857.082] [distfit] >[genlogistic] [0.05 sec] [RSS: 0.171884] [loc=6.090 scale=6.470] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[wald ] [0.02 sec] [RSS: 1.26588e-11] [loc=-11170.683 scale=44342.275] [distfit] >[exponnorm] [0.14 sec] [RSS: 2.02968e-11] [loc=3.789 scale=22.622] [distfit] >[expon ] [0.00 sec] [RSS: 2.04285e-11] [loc=0.000 scale=56737.837] [distfit] >[genexpon ] [1.37 sec] [RSS: 2.04285e-11] [loc=-0.000 scale=110178.329] [distfit] >[gilbrat ] [0.04 sec] [RSS: 2.89726e-11] [loc=-3827.821 scale=13823.906] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: wald'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[gengamma] [0.28 sec] [RSS: 0.00725802] [loc=1.000 scale=0.411] [distfit] >[pearson3] [0.20 sec] [RSS: 0.00247779] [loc=2.608 scale=2.146] [distfit] >[gilbrat ] [0.02 sec] [RSS: 0.0078055] [loc=0.555 scale=1.605] [distfit] >[burr ] [0.34 sec] [RSS: 0.23682] [loc=1.000 scale=0.000] [distfit] >[alpha ] [0.04 sec] [RSS: 0.0121778] [loc=0.671 scale=0.501] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: pearson3'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[pearson3 ] [0.21 sec] [RSS: 4.58683e-05] [loc=10.929 scale=11.176] [distfit] >[gilbrat ] [0.02 sec] [RSS: 4.49006e-05] [loc=-0.776 scale=7.945] [distfit] >[wald ] [0.01 sec] [RSS: 4.75319e-05] [loc=-2.523 scale=16.480] [distfit] >[exponnorm] [0.11 sec] [RSS: 0.000100346] [loc=0.985 scale=0.005] [distfit] >[genexpon ] [1.41 sec] [RSS: 0.000100512] [loc=1.000 scale=27.758] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: gilbrat'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halfcauchy] [0.06 sec] [RSS: 1.35239e-11] [loc=-0.000 scale=5332.925] [distfit] >[cauchy ] [0.01 sec] [RSS: 4.21087e-11] [loc=3171.441 scale=3987.686] [distfit] >[gilbrat ] [0.03 sec] [RSS: 5.50217e-11] [loc=-1682.041 scale=9469.309] [distfit] >[beta ] [0.16 sec] [RSS: 1.30578e-10] [loc=-0.000 scale=112625113.232] [distfit] >[wald ] [0.02 sec] [RSS: 1.76875e-10] [loc=-3942.148 scale=20442.556] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halfcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halflogistic] [0.04 sec] [RSS: 1.58525e-07] [loc=-0.000 scale=37.574] [distfit] >[gumbel_r ] [0.00 sec] [RSS: 5.16349e-07] [loc=24.265 scale=34.796] [distfit] >[genlogistic ] [0.11 sec] [RSS: 5.41946e-07] [loc=-229.704 scale=34.830] [distfit] >[t ] [0.22 sec] [RSS: 8.07242e-07] [loc=31.008 scale=32.174] [distfit] >[hypsecant ] [0.01 sec] [RSS: 1.77855e-06] [loc=33.612 scale=34.775] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[f ] [0.13 sec] [RSS: 6.39223e-05] [loc=-4.684 scale=74.088] [distfit] >[lognorm ] [0.13 sec] [RSS: 6.37886e-05] [loc=11.428 scale=58.634] [distfit] >[maxwell ] [0.01 sec] [RSS: 5.45636e-05] [loc=37.901 scale=21.233] [distfit] >[betaprime] [0.15 sec] [RSS: 6.0882e-05] [loc=-0.851 scale=25.570] [distfit] >[erlang ] [0.05 sec] [RSS: 6.11575e-05] [loc=29.325 scale=4.816] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[fisk ] [0.23 sec] [RSS: 0.000320445] [loc=10.885 scale=12.877] [distfit] >[exponnorm] [0.03 sec] [RSS: 0.000405039] [loc=21.083 scale=2.038] [distfit] >[burr ] [0.23 sec] [RSS: 0.000410933] [loc=-0.111 scale=21.703] [distfit] >[mielke ] [0.17 sec] [RSS: 0.000411179] [loc=-0.171 scale=21.757] [distfit] >[johnsonsu] [0.28 sec] [RSS: 0.000503498] [loc=20.226 scale=4.310] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: fisk'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
Most frequent distributions are f and Pearson distribution, then loggamma, dgamma, Laplace, Wald, generalized Gamma, half Cauchy, Fisk and half logistic.
This section will be done later.
The variables taken as categorical are:
Data can be looked through frequency tables with percentages that are shown below.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 1411.00 | 38.97% | 38.97% | |
| male | 2210.00 | 61.03% | 100.0% | |
| Total | 3621.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 348.00 | 9.61% | 9.61% | |
| active | 2024.00 | 55.9% | 65.51% | |
| sedentary | 1249.00 | 34.49% | 100.0% | |
| Total | 3621.00 | 100.0% | - | |
| Goal | ||||
| lose | 1467.00 | 40.51% | 40.51% | |
| gain | 1639.00 | 45.26% | 85.78% | |
| antiaging | 515.00 | 14.22% | 100.0% | |
| Total | 3621.00 | 100.0% | - | |
| Language | ||||
| en | 179.00 | 4.94% | 4.94% | |
| es | 3442.00 | 95.06% | 100.0% | |
| Total | 3621.00 | 100.0% | - | |
| Body_type | ||||
| thin | 1586.00 | 43.8% | 43.8% | |
| mid | 1692.00 | 46.73% | 90.53% | |
| strong | 343.00 | 9.47% | 100.0% | |
| Total | 3621.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 2300.00 | 63.52% | 63.52% | |
| Obesity | 295.00 | 8.15% | 71.67% | |
| Overweight | 937.00 | 25.88% | 97.54% | |
| Underweight | 89.00 | 2.46% | 100.0% | |
| Total | 3621.00 | 100.0% | - |
From the data, we can see disproportion between men 2221 (61%) and women 1419 (39%). The biggest group with activity level active counts 2032 (56%) of subset of observations, then sedentary - 1255 (34%) observations and very active - 353 (10%) observations. In the whole dataset there is the same order, respectively 52%, 36% and 12% of all users. In this subset, for goal variable, the biggest count of occurrences is for gain group - 1647 (42%), then to lose - 1476 (41%) and the last is antiaging group with 517 (14%) observations. In the whole dataset, the biggest group is lose group with 8257 (44%) observations, then gain with 7838 (42%) observations and antiaging with 2593 (14%) observations. Almost 95% of this users subset chose Spanish (3449 observations) and 5% chose English for their app language. Mostly chosen body type is mid with 1700 (47%) observations, then it is thin with 1594 (44%) observations and the smallest group is strong with 346 (10%) observations. The biggest group in this subset is group of people with normal weight - 2300 (64%), then with overweight - 937 (26%), obesity - 295 (8%) and underweight - 89 (2%).
| Total | ES | AR | MX | US | CL | FR | CO | DE | CH | ... | JP | KG | LB | LT | LU | MA | ML | MY | NI | JM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 1098 | 874 | 40 | 33 | 26 | 15 | 13 | 11 | 11 | 9 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 79.6% | 3.64% | 3.01% | 2.37% | 1.37% | 1.18% | 1.0% | 1.0% | 0.82% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 80 columns
Most of the people that decided to share their country were, as previously, from Spain, but second biggest group is from Argentina (previously it was USA). Now, for the affiliate code.
| Total | endika | fitness_revolucionario | lifestyle_con_blanca | mammothhunters | mariapelazas | cristinamanyer | martina_ferrer_ | keto_aove | MyHixel | ... | Anavb87 | janetgzzl | MerakiFit | gloriaalcalar | gloria_martinez | fullmusculo | eat2winmedia | dracaminodiaz | anabel_freyes | healthybyjane | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 115 | 25 | 19 | 19 | 16 | 14 | 6 | 6 | 4 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 21.74% | 16.52% | 16.52% | 13.91% | 12.17% | 5.22% | 5.22% | 3.48% | 1.74% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 28 columns
As previously, the most frequently used affiliate code was endika, but now second most used code is fitness_revolucionario, when previously it was mariapelezas. Number of valid observations decreased from 867 to 124.
The variables taken as boolean are:
Data can be looked through frequency tables with percentages after converting it to categorical values.
| Frequency | Percent | ||
|---|---|---|---|
| Variable | factors | ||
| scientific_data_usage | |||
| False | 2228.00 | 61.53% | |
| True | 1393.00 | 38.47% | |
| Total | 3621.00 | 100.0% | |
| newsletter_subscription | |||
| False | 1085.00 | 29.96% | |
| True | 2536.00 | 70.04% | |
| Total | 3621.00 | 100.0% | |
| notifications_setting | |||
| False | 69.00 | 1.91% | |
| True | 3552.00 | 98.09% | |
| Total | 3621.00 | 100.0% | |
| training_days_setting | |||
| True | 3621.00 | 100.0% | |
| Total | 3621.00 | 100.0% |
Scientific data usage agreement decreased from 5858 to 1395 observations. Now agreement to scientific_data_usage is 38% of non-NULL observations. 2542 (70%) people signed up for newsletter_subscription. Number of people that turned on notification_settings is 3570 (98%). All of the users turned on training_days_setting.
Taking into consideration only data of users that agreed on scientific usage of their data, it is possible to prepare similar analysis.
Summary of number of NULLs and data types are given below. In total, there are 5858 observations.
<class 'pandas.core.frame.DataFrame'> Int64Index: 5858 entries, 5 to 18687 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 5858 non-null category 1 created_at 5858 non-null datetime64[ns] 2 updated_at 5858 non-null datetime64[ns] 3 gender 5858 non-null category 4 date_of_birth 5858 non-null datetime64[ns] 5 height 5858 non-null float64 6 weight 5858 non-null float64 7 activity_level 5858 non-null category 8 goal 5858 non-null category 9 body_type 5858 non-null category 10 body_fat 5858 non-null float64 11 newsletter_subscription 5858 non-null bool 12 notifications_setting 5858 non-null bool 13 training_days_setting 5858 non-null bool 14 language 5858 non-null category 15 country 215 non-null category 16 points 5858 non-null int64 17 scientific_data_usage 5858 non-null category 18 best_weekly_streak 5858 non-null int64 19 affiliate_code_signup 13 non-null category 20 total_sessions 1395 non-null float64 21 total_time 1395 non-null float64 22 kcal_per_session 1395 non-null float64 23 reps_per_session 1395 non-null float64 24 height[m] 5858 non-null float64 25 BMI 5853 non-null float64 26 BMI_category 5853 non-null category dtypes: bool(3), category(10), datetime64[ns](3), float64(9), int64(2) memory usage: 1.4 MB
The variables taken into consideration as numerical data will be the same as before:
Table with summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below. They may be different than the ones in the first table.
| count | mean | std | min | 25% | 50% | 75% | max | var | skewness | kurtosis | NULL count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| height | 5858.00 | 168.78 | 10.60 | 1.00 | 162.00 | 169.00 | 175.26 | 236.22 | 112.40 | -3.03 | 48.42 | 0 |
| weight | 5858.00 | 72.06 | 16.63 | 38.55 | 60.00 | 70.00 | 81.00 | 277.00 | 276.72 | 1.29 | 6.29 | 0 |
| body_fat | 5858.00 | 24.93 | 8.62 | 6.60 | 20.00 | 25.00 | 30.00 | 50.00 | 74.24 | 0.60 | 0.08 | 0 |
| points | 5858.00 | 8239.90 | 67324.18 | 0.00 | 0.00 | 0.00 | 0.00 | 2463230.00 | 4532545127.76 | 16.21 | 397.37 | 0 |
| best_weekly_streak | 5858.00 | 0.94 | 3.45 | 0.00 | 0.00 | 0.00 | 0.00 | 49.00 | 11.88 | 7.51 | 72.43 | 0 |
| total_sessions | 1395.00 | 16.25 | 30.63 | 1.00 | 2.00 | 5.00 | 15.00 | 274.00 | 938.37 | 3.59 | 16.11 | 4463 |
| total_time | 1395.00 | 20087.24 | 43611.22 | 0.00 | 1165.00 | 3901.00 | 14918.50 | 336812.00 | 1901938878.89 | 3.82 | 17.07 | 4463 |
| kcal_per_session | 1395.00 | 52.88 | 159.27 | 0.00 | 7.01 | 31.25 | 69.00 | 4147.00 | 25366.99 | 18.33 | 405.04 | 4463 |
| reps_per_session | 1395.00 | 2001.32 | 71395.05 | 0.00 | 15.00 | 64.00 | 126.00 | 2666671.00 | 5097253462.07 | 37.35 | 1394.99 | 4463 |
| BMI | 5853.00 | 25.13 | 4.92 | 10.75 | 21.84 | 24.28 | 27.59 | 87.62 | 24.17 | 1.45 | 6.77 | 5 |
There are 5858 users in the subset of users data table. Comparing to whole users table, median height is lower - from 171 cm (with IQR 164-178) to 169 cm (IQR 162 - 175), mean height decreased from 169.67 cm (SD 23.09) to 168.78 cm (SD 10.6) and maximum height also decreased from 1780 cm to 236 cm. Median weight decreased from 72 kg (IQR 62-82) to 70 kg (IQR 60 - 81), where minimum increased from 22 kg to 39 kg and maximum stayed the same at 277 kg. Mean weight decreased from 73.16 kg (SD 15.84) to 72.06 kg (SD 16.63). Median and mean body fat stayed the same and are respectively 25% (IQR 20% - 30%) and 24.28% (SD 8.6), while minimum given body fat increased from 2% to 6.6% and maximum decreased from 80% to 50%. Median value of points decreased from 100 (IQR 0 - 5047) to 0 (IQR 0 - 0), maximum decreased from 2749450 to 2463230 and mean decreased from 19478.15 (SD 93727.46) to 8240 (SD 67324). Best_weekly_streak among all of the users stayed at 49 weeks, median is also the same - 0 IQR (0 - 0) and mean increased from 0.85 (SD 3.21) to 0.94 (SD 3.45). For total_sessions median value stayed the same, but IQR changed - all table: 5 (IQR 2 - 19), subset: 5 (2 - 15). Mean values of total_session decreased from 18.79 (SD 35.61) to 16.25 (SD 30.63). Maximum value decreased from 922 to 274 sessions. Median total_time (that is in minutes?) decreased from 5115.5 (IQR 1539.5 - 21869) to 3901 (IQR 1165 - 14919), mean decreased from 23281.25 (SD 45236.45) to 20087 (SD 43611), the minimum value stayed at 0 and the maximum value decreased from 622509 to 336812. Median of average value of burned kilo calories per session (for every user separately) increased from 24.08 kcal (IQR 5.15 - 68) to 31.25 kcal (IQR 7 - 69) and the mean increased from 48.99 kcal (SD 144.33) to 52.88 kcal (SD 159). Median of average number of reps per session (for every user separately) decreased from 45 (IQR 11 -124) to 64 (IQR 15 - 126), maximum value decreased from 34597012 to 2666671 and mean decreased from 10355.27 (SD 575130.19) to 2001.32 (SD 71395.05). Median BMI is 24 - normal weight group (IQR 22-28), minimum is 10.75, mean BMI value is 25 - overweight (SD 5) and the maximum is 88 (probably a mistake made by user - extreme outlier).
In the last five variables there are 4463 and 5 NULL observations, which gives 76% of NULL observations just for this subset. That means there are only 1393 valid observations for five last observations.
Text(0.5, 0.98, 'Histogram plots for numeric variables of users table subset')
Text(0.5, 1.05, 'QQ plots for numeric variables of users table subset')
| height | weight | body_fat | points | best_weekly_streak | total_sessions | total_time | kcal_per_session | reps_per_session | BMI | |
|---|---|---|---|---|---|---|---|---|---|---|
| W | 0.86 | 0.94 | 0.94 | 0.10 | 0.29 | 0.54 | 0.49 | 0.18 | 0.01 | 0.93 |
| pval | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| normal | False | False | False | False | False | False | False | False | False | False |
From the table the plots above, Shapiro-Wilk test, skewness and kurtosis we can assume that none of the data has normal distribution.
The distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.
[distfit] >fit.. [distfit] >transform.. [distfit] >[alpha ] [0.11 sec] [RSS: 8.73989e-05] [loc=-410.812 scale=35696.799] [distfit] >[pearson3 ] [0.13 sec] [RSS: 9.13647e-05] [loc=168.924 scale=9.449] [distfit] >[chi ] [0.14 sec] [RSS: 9.45143e-05] [loc=95.946 scale=13.413] [distfit] >[vonmises_line] [0.66 sec] [RSS: 0.00010522] [loc=168.854 scale=53.439] [distfit] >[exponnorm ] [0.04 sec] [RSS: 0.000113741] [loc=168.779 scale=10.601] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: alpha'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[erlang ] [0.09 sec] [RSS: 5.16235e-05] [loc=31.492 scale=6.594] [distfit] >[gamma ] [0.04 sec] [RSS: 5.16236e-05] [loc=31.493 scale=6.594] [distfit] >[chi2 ] [0.09 sec] [RSS: 5.16236e-05] [loc=31.492 scale=3.297] [distfit] >[pearson3] [0.10 sec] [RSS: 5.16237e-05] [loc=72.055 scale=16.355] [distfit] >[beta ] [0.15 sec] [RSS: 5.19325e-05] [loc=31.742 scale=25056174.059] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: erlang'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[dgamma ] [0.02 sec] [RSS: 0.186749] [loc=23.155 scale=3.651] [distfit] >[dweibull ] [0.03 sec] [RSS: 0.188869] [loc=23.376 scale=7.645] [distfit] >[triang ] [0.25 sec] [RSS: 0.191619] [loc=6.572 scale=44.826] [distfit] >[genlogistic] [0.05 sec] [RSS: 0.19202] [loc=7.855 scale=6.721] [distfit] >[genextreme ] [0.31 sec] [RSS: 0.192084] [loc=21.238 scale=7.440] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halflogistic] [0.07 sec] [RSS: 8.39718e-11] [loc=-0.000 scale=8032.906] [distfit] >[gompertz ] [0.17 sec] [RSS: 1.11383e-10] [loc=-0.000 scale=791154496762173184.000] [distfit] >[truncnorm ] [0.28 sec] [RSS: 1.20445e-10] [loc=-535.690 scale=67750.432] [distfit] >[foldnorm ] [0.13 sec] [RSS: 1.21592e-10] [loc=-0.000 scale=67820.844] [distfit] >[halfnorm ] [0.04 sec] [RSS: 1.21592e-10] [loc=-0.000 scale=67820.847] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[gompertz] [0.18 sec] [RSS: 0.0159231] [loc=-0.000 scale=459.936] [distfit] >[lomax ] [0.06 sec] [RSS: 0.0189393] [loc=-0.000 scale=3.403] [distfit] >[pearson3] [0.28 sec] [RSS: 0.0270229] [loc=0.570 scale=0.670] [distfit] >[genexpon] [1.36 sec] [RSS: 0.0356292] [loc=-0.000 scale=2.023] [distfit] >[expon ] [0.00 sec] [RSS: 0.0356374] [loc=0.000 scale=0.943] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: gompertz'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
For
the distribution cannot be compute, because of the NULL values. It will be fitted in a later part of this document.
Every variable has different distribution (Alpha, Erlang, dgamma, folded normal, half logistic, Gompertz and generalized exponential distribution).
The variables taken as categorical are:
Data can be looked through frequency tables with percentages that are shown below.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 2981.00 | 50.89% | 50.89% | |
| male | 2877.00 | 49.11% | 100.0% | |
| Total | 5858.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 531.00 | 9.06% | 9.06% | |
| active | 2907.00 | 49.62% | 58.69% | |
| sedentary | 2420.00 | 41.31% | 100.0% | |
| Total | 5858.00 | 100.0% | - | |
| Goal | ||||
| lose | 2947.00 | 50.31% | 50.31% | |
| gain | 2217.00 | 37.85% | 88.15% | |
| antiaging | 694.00 | 11.85% | 100.0% | |
| Total | 5858.00 | 100.0% | - | |
| Language | ||||
| en | 31.00 | 0.53% | 0.53% | |
| es | 5827.00 | 99.47% | 100.0% | |
| Total | 5858.00 | 100.0% | - | |
| Body_type | ||||
| thin | 2776.00 | 47.39% | 47.39% | |
| mid | 2226.00 | 38.0% | 85.39% | |
| strong | 856.00 | 14.61% | 100.0% | |
| Total | 5858.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 3106.00 | 53.07% | 53.07% | |
| Obesity | 846.00 | 14.45% | 67.52% | |
| Overweight | 1667.00 | 28.48% | 96.0% | |
| Underweight | 234.00 | 4.0% | 100.0% | |
| Total | 5853.00 | 100.0% | - |
Male and female groups are almost equipotencial (female 51%, male 49%, in the whole dataset it was 42 % of female and 58% of male). Activity level is the biggest in active group 2907 (50%) observations, then in sedentary group 2420 (41%) and the smallest group is very active 531 (9%). The same group order is in the whole dataset (respectively 52%, 36% and 12%). Most of the users had a goal to lose weight - 2947 (50%), then to gain weight - 2219 (38%) and the smallest group had antiaging goal - 694 (12%). Again, there is the same order as in whole dataset (respectively 44%, 42% and 14%). Most of the people chose Spanish app language 5827 (99%), only 31 (1%) users chose English language. In this subset most of the users chose that their body type is thin - 2776 (47%), then mid - 2226 (38%) and strong - 586 (15%). Looking at the whole dataset analysis, the biggest group us mid (47%), then thin (41%) and strong (12%). Number of people with normal BMI is 3106 (53%), then overweight users are 1667 (28%), people with obesity - 846 (14%) and underweight people - 234 (4%).
Below is the frequency table and barplot of variable country.
| Total | ES | AR | MX | CH | US | CA | CL | CO | AU | ... | HR | HU | IN | IT | AE | JP | KG | LB | LT | JM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 215 | 173 | 9 | 6 | 4 | 4 | 2 | 2 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 80.47% | 4.19% | 2.79% | 1.86% | 1.86% | 0.93% | 0.93% | 0.93% | 0.47% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 80 columns
In the scientific_data_usage agreement subset only 215 users chose to give their country. 173 (80% of them chose Spain, 9 (4%) chose Argentina and 6 (3%) chose Mexico. In the whole dataset, 4866 (77%) chose Spain, 273 (4%) chose USA and 219 (3%) chose Argentina.
The frequency table and barplot of affiliate_code_signup is located below.
| Total | endika | fitness_revolucionario | mammothhunters | martina_ferrer_ | cristinamanyer | mariapelazas | keto_aove | pablo_kuhnert | nicotononpt | ... | gloriaalcalar | gloria_martinez | fullmusculo | eat2winmedia | dracaminodiaz | blanca | andreajuan | anabel_freyes | MyHixel | healthybyjane | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 13 | 4 | 3 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 30.77% | 23.08% | 23.08% | 7.69% | 7.69% | 7.69% | 0.0% | 0.0% | 0.0% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 28 columns
In this data subset, total of the users that signed up by affiliate code is 13. Number of affiliate codes used are 6. Most frequent one is, as previously, endika - 4 (31%).
The variables taken as boolean are:
Data can be looked through frequency tables with percentages after converting it to categorical values.
| Frequency | Percent | ||
|---|---|---|---|
| Variable | factors | ||
| scientific_data_usage | |||
| False | 0.00 | 0.0% | |
| True | 5858.00 | 100.0% | |
| Total | 5858.00 | 100.0% | |
| newsletter_subscription | |||
| False | 1079.00 | 18.42% | |
| True | 4779.00 | 81.58% | |
| Total | 5858.00 | 100.0% | |
| notifications_setting | |||
| False | 24.00 | 0.41% | |
| True | 5834.00 | 99.59% | |
| Total | 5858.00 | 100.0% | |
| training_days_setting | |||
| True | 5858.00 | 100.0% | |
| Total | 5858.00 | 100.0% |
In this subset, 4779 (82%) of users signed up for newsletter_subscription and 5834 (99.6%) agreed on notification_settings (notifications).
Looking at numeric data, there is only 1393 valid observations. In this valid data (according to numeric data) there is only 177 valid country observations, 609 valid current_last_sign_in and last_sign_in_at observations and 11 valid observations of affiliate_code_signup.
Below there are barplot and document with types and non-NULL count of observations.
<class 'pandas.core.frame.DataFrame'> Int64Index: 1393 entries, 5 to 18660 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1393 non-null category 1 created_at 1393 non-null datetime64[ns] 2 updated_at 1393 non-null datetime64[ns] 3 gender 1393 non-null category 4 date_of_birth 1393 non-null datetime64[ns] 5 height 1393 non-null float64 6 weight 1393 non-null float64 7 activity_level 1393 non-null category 8 goal 1393 non-null category 9 body_type 1393 non-null category 10 body_fat 1393 non-null float64 11 newsletter_subscription 1393 non-null bool 12 notifications_setting 1393 non-null bool 13 training_days_setting 1393 non-null bool 14 language 1393 non-null category 15 country 177 non-null category 16 points 1393 non-null int64 17 scientific_data_usage 1393 non-null category 18 best_weekly_streak 1393 non-null int64 19 affiliate_code_signup 11 non-null category 20 total_sessions 1393 non-null float64 21 total_time 1393 non-null float64 22 kcal_per_session 1393 non-null float64 23 reps_per_session 1393 non-null float64 24 height[m] 1393 non-null float64 25 BMI 1393 non-null float64 26 BMI_category 1393 non-null category dtypes: bool(3), category(10), datetime64[ns](3), float64(9), int64(2) memory usage: 849.2 KB
The variables taken into consideration as numerical data will be the same as before:
NULL values, in table with numerical data in this subset, occur only for variables total_sessions, total_time, kcal_per_session, reps_per_session. In those variables there are only 1395 valid observations (23.81% of all observations from the subset). Taking into consideration only valid data for our subset, the summary statistics is given below.
| count | mean | std | min | 25% | 50% | 75% | max | var | skewness | kurtosis | NULL count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| height | 1393.00 | 170.24 | 9.27 | 142.00 | 163.00 | 170.00 | 177.00 | 202.00 | 86.01 | 0.07 | -0.34 | 0 |
| weight | 1393.00 | 71.83 | 15.86 | 40.00 | 60.00 | 70.00 | 80.00 | 277.00 | 251.39 | 2.02 | 19.77 | 0 |
| body_fat | 1393.00 | 24.58 | 8.56 | 6.60 | 20.00 | 25.00 | 30.00 | 50.00 | 73.30 | 0.53 | 0.00 | 0 |
| points | 1393.00 | 32792.22 | 133700.91 | 0.00 | 200.00 | 500.00 | 2700.00 | 2463230.00 | 17875934333.43 | 8.11 | 99.33 | 0 |
| best_weekly_streak | 1393.00 | 3.96 | 6.16 | 1.00 | 1.00 | 2.00 | 4.00 | 49.00 | 37.99 | 3.95 | 18.85 | 0 |
| total_sessions | 1393.00 | 16.26 | 30.65 | 1.00 | 2.00 | 5.00 | 15.00 | 274.00 | 939.64 | 3.59 | 16.08 | 0 |
| total_time | 1393.00 | 20104.16 | 43640.24 | 0.00 | 1165.00 | 3879.00 | 14937.00 | 336812.00 | 1904470598.41 | 3.81 | 17.04 | 0 |
| kcal_per_session | 1393.00 | 52.92 | 159.38 | 0.00 | 7.00 | 31.25 | 69.00 | 4147.00 | 25402.32 | 18.32 | 404.47 | 0 |
| reps_per_session | 1393.00 | 2004.16 | 71446.28 | 0.00 | 15.00 | 64.00 | 126.00 | 2666671.00 | 5104571479.21 | 37.32 | 1392.99 | 0 |
| BMI | 1393.00 | 24.66 | 4.46 | 13.78 | 21.87 | 23.95 | 26.67 | 87.62 | 19.91 | 2.77 | 28.95 | 0 |
There are 1395 users in the subset of scientific_data_usage subset data table. Comparing to the scientific_data_usage subset table, median height is bigger - from 169 cm (IQR 162 - 175) to 179 cm (IQR 163 - 177), mean height increased from 168.78 cm (SD 10.6) to 170 cm (SD 11.18) and maximum height decreased from 236 cm to 202 cm. Median weight stayed the same - 70 kg (IQR 60 - 81), where minimum increased from 39 kg to 40 kg and maximum stayed the same at 277 kg. Mean weight decreased from 72.06 kg (SD 16.63) to 71.93 kg (SD 16.07). Median body fat stayed the same at 25% (IQR 20% - 30%) and mean body fat increased from 24.28% (SD 8.6) to 24.58% (SD 8.56), while minimum and maximum given body fat stayed the same at respectively 6.6% and 50%. Median value of points increased from 0 (IQR 0 - 0) to 500 (IQR 200 - 2800), maximum stayed the same at 2463230 and mean increased from 8240 (SD 67324) to 32765.54 (SD 133606.96). Best_weekly_streak among the subset of the users stayed at 49 weeks, median increased from 0 IQR (0 - 0) to 2 (IQR 1 - 4) and mean increased from 0.94 (SD 3.45) to 3.96 (SD 6.16). Median, mean and maximum value for toal_sessions stayed the same respectively at 5 (IQR 2 - 15), 16.25 (SD 30.63) and 274 sessions. Median, mean, minimum and maximum of total_time (that is in minutes?) stayed the same at respectively 3901 (IQR 1165 - 14919) and 20087 (SD 43611), 0 and 336812. Median and mean of average value of burned kilo calories per session (for every user separately) stayed the same at 31.25 kcal (IQR 7 - 69) and 52.88 kcal (SD 159). Median. maximum and mean of average number of reps per session (for every user separately) stayed the same at 64 (IQR 15 - 126), 2666671 and 2001.32 (SD 71395.05). Median BMI stayed the same at 24 - normal weight group (IQR 22-27), minimum increased from 11 to 14, mean BMI value stayed the same 25 - overweight (SD 4) and the maximum also stayed the same at 88 (probably a mistake made by user - extreme outlier).
The normality of this subset of data is checked by the same method as previously.
Text(0.5, 0.98, 'Histogram plots for all numeric variables without NULLs')
Text(0.5, 1.05, 'QQ plots for all numeric variables without NULLs')
| height | weight | body_fat | points | best_weekly_streak | total_sessions | total_time | kcal_per_session | reps_per_session | BMI | |
|---|---|---|---|---|---|---|---|---|---|---|
| W | 0.99 | 0.91 | 0.95 | 0.26 | 0.52 | 0.54 | 0.49 | 0.18 | 0.01 | 0.87 |
| pval | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| normal | False | False | False | False | False | False | False | False | False | False |
As previously, there is no normality in data, even when the NULL data observations are omitted. Skewness and kurtosis are another proof of non-normality of data.
The distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.
[distfit] >fit.. [distfit] >transform.. [distfit] >[chi ] [0.06 sec] [RSS: 0.00502087] [loc=95.889 scale=13.171] [distfit] >[beta ] [0.03 sec] [RSS: 0.0048946] [loc=131.393 scale=84.604] [distfit] >[t ] [0.18 sec] [RSS: 0.00502322] [loc=170.237 scale=9.270] [distfit] >[powerlognorm] [0.37 sec] [RSS: 0.00503724] [loc=-1.809 scale=174.453] [distfit] >[burr ] [0.25 sec] [RSS: 0.00534082] [loc=-2.278 scale=169.614] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: beta'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[maxwell ] [0.00 sec] [RSS: 8.84655e-05] [loc=36.067 scale=22.585] [distfit] >[erlang ] [0.07 sec] [RSS: 9.22636e-05] [loc=30.375 scale=5.630] [distfit] >[pearson3] [0.05 sec] [RSS: 9.31583e-05] [loc=71.831 scale=15.368] [distfit] >[gamma ] [0.04 sec] [RSS: 9.31579e-05] [loc=31.061 scale=5.793] [distfit] >[chi2 ] [0.02 sec] [RSS: 9.31577e-05] [loc=31.061 scale=2.897] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[dgamma ] [0.02 sec] [RSS: 0.172946] [loc=23.122 scale=3.555] [distfit] >[dweibull ] [0.02 sec] [RSS: 0.174803] [loc=23.257 scale=7.674] [distfit] >[triang ] [0.16 sec] [RSS: 0.177868] [loc=6.436 scale=44.631] [distfit] >[genextreme] [0.10 sec] [RSS: 0.178235] [loc=20.955 scale=7.518] [distfit] >[lognorm ] [0.07 sec] [RSS: 0.178248] [loc=-12.397 scale=36.007] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[exponnorm ] [0.09 sec] [RSS: 2.24662e-11] [loc=-3.063 scale=17.928] [distfit] >[genexpon ] [1.35 sec] [RSS: 2.25394e-11] [loc=-0.000 scale=81245.812] [distfit] >[expon ] [0.00 sec] [RSS: 2.25394e-11] [loc=0.000 scale=32792.217] [distfit] >[halflogistic ] [0.03 sec] [RSS: 3.90605e-11] [loc=-0.000 scale=30638.370] [distfit] >[genhalflogistic] [0.09 sec] [RSS: 3.8801e-11] [loc=-0.495 scale=30579.066] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[gilbrat ] [0.02 sec] [RSS: 0.00470744] [loc=0.639 scale=1.244] [distfit] >[beta ] [0.10 sec] [RSS: 0.0739864] [loc=1.000 scale=694.957] [distfit] >[pearson3] [0.13 sec] [RSS: 0.0100343] [loc=2.878 scale=2.142] [distfit] >[cauchy ] [0.00 sec] [RSS: 0.014674] [loc=1.234 scale=0.669] [distfit] >[wald ] [0.01 sec] [RSS: 0.0194036] [loc=0.312 scale=2.746] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: gilbrat'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[foldcauchy] [0.04 sec] [RSS: 9.84436e-05] [loc=1.000 scale=3.184] [distfit] >[halfcauchy] [0.02 sec] [RSS: 9.90314e-05] [loc=1.000 scale=3.196] [distfit] >[cauchy ] [0.00 sec] [RSS: 0.000152909] [loc=2.988 scale=2.849] [distfit] >[alpha ] [0.01 sec] [RSS: 0.000374838] [loc=0.001 scale=1.883] [distfit] >[t ] [0.08 sec] [RSS: 0.000391715] [loc=2.467 scale=2.082] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: foldcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halfcauchy ] [0.05 sec] [RSS: 3.61999e-11] [loc=-0.000 scale=3962.202] [distfit] >[cauchy ] [0.01 sec] [RSS: 1.08565e-10] [loc=2464.453 scale=2924.470] [distfit] >[t ] [0.16 sec] [RSS: 1.38392e-10] [loc=1925.511 scale=1991.077] [distfit] >[tukeylambda] [1.63 sec] [RSS: 1.9457e-10] [loc=1884.669 scale=577.463] [distfit] >[gilbrat ] [0.02 sec] [RSS: 4.68186e-10] [loc=-1254.375 scale=7147.656] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halfcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[halflogistic ] [0.03 sec] [RSS: 3.15684e-07] [loc=-0.000 scale=39.855] [distfit] >[genhalflogistic] [0.09 sec] [RSS: 3.17229e-07] [loc=-0.000 scale=39.864] [distfit] >[gumbel_r ] [0.00 sec] [RSS: 3.90096e-07] [loc=27.268 scale=36.244] [distfit] >[genlogistic ] [0.07 sec] [RSS: 4.08793e-07] [loc=-232.292 scale=36.275] [distfit] >[dgamma ] [0.06 sec] [RSS: 1.47624e-05] [loc=60.000 scale=65.421] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[truncnorm] [0.26 sec] [RSS: 1.07664e-10] [loc=-597.181 scale=71349.997] [distfit] >[foldnorm ] [0.05 sec] [RSS: 1.08756e-10] [loc=-0.000 scale=71432.140] [distfit] >[halfnorm ] [0.02 sec] [RSS: 1.09915e-10] [loc=-0.000 scale=71874.572] [distfit] >[rice ] [0.04 sec] [RSS: 1.37732e-10] [loc=-71305.258 scale=72371.189] [distfit] >[rayleigh ] [0.00 sec] [RSS: 1.37732e-10] [loc=-71305.259 scale=72371.189] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: truncnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
[distfit] >fit.. [distfit] >transform.. [distfit] >[fisk ] [0.14 sec] [RSS: 0.000277334] [loc=11.018 scale=13.020] [distfit] >[exponnorm] [0.02 sec] [RSS: 0.000311229] [loc=21.081 scale=2.296] [distfit] >[burr ] [0.14 sec] [RSS: 0.000319281] [loc=-0.107 scale=21.995] [distfit] >[mielke ] [0.10 sec] [RSS: 0.000319468] [loc=-0.190 scale=22.069] [distfit] >[johnsonsu] [0.26 sec] [RSS: 0.000383062] [loc=19.490 scale=5.247] [distfit] >Compute confidence interval [parametric] [distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
<AxesSubplot:title={'center':'Best fit: fisk'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
Most frequent distributions is dgamma, then there are Chi, Maxwell, exponentially modified Gaussian (exponnorm), Gilbrat, Pearson, folded Cauchy, half logistic, Fisk and truncated normal.
This section will be done later.
The variables taken as categorical are:
Data can be looked through frequency tables with percentages that are shown below.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 607.00 | 43.58% | 43.58% | |
| male | 786.00 | 56.42% | 100.0% | |
| Total | 1393.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 127.00 | 9.12% | 9.12% | |
| active | 723.00 | 51.9% | 61.02% | |
| sedentary | 543.00 | 38.98% | 100.0% | |
| Total | 1393.00 | 100.0% | - | |
| Goal | ||||
| lose | 628.00 | 45.08% | 45.08% | |
| gain | 583.00 | 41.85% | 86.93% | |
| antiaging | 182.00 | 13.07% | 100.0% | |
| Total | 1393.00 | 100.0% | - | |
| Language | ||||
| en | 22.00 | 1.58% | 1.58% | |
| es | 1371.00 | 98.42% | 100.0% | |
| Total | 1393.00 | 100.0% | - | |
| Body_type | ||||
| thin | 646.00 | 46.37% | 46.37% | |
| mid | 582.00 | 41.78% | 88.16% | |
| strong | 165.00 | 11.84% | 100.0% | |
| Total | 1393.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 809.00 | 58.08% | 58.08% | |
| Obesity | 147.00 | 10.55% | 68.63% | |
| Overweight | 395.00 | 28.36% | 96.98% | |
| Underweight | 42.00 | 3.02% | 100.0% | |
| Total | 1393.00 | 100.0% | - |
In this subset of scientific_data_usage agreement table women are 608 (44%) observations and men are 787 (56%) observations. The biggest activity level group is active with 724 (52%) observations, then in sedentary group with 543 (39%) observations and the smallest group is very active with 128 (9%) observations. Most of the users from this subset had a goal to lose weight - 630 (45%), then to gain weight - 583 (42%) and the smallest group is antiaging goal - 182 (13%). Most of the people chose Spanish app language 1373 (98%), only 22 (2%) uers chose English language. In this subset most of the users chose that their body type is thin - 648 (46%), then mid - 582 (42%) and strong - 165 (12%). Biggest number of occureces for BMI category is normal category - 809 (58%), then overweight - 395 (28%), obesity - 147 (11%) and underweight - 42 (3%).
Below is the frequency table and barplot of variable country.
| Total | ES | AR | MX | CH | CO | CA | CL | AU | DO | ... | HR | HU | IN | IT | AE | JP | KG | LB | LT | JM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 177 | 145 | 7 | 5 | 3 | 2 | 2 | 2 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 81.92% | 3.95% | 2.82% | 1.69% | 1.13% | 1.13% | 1.13% | 0.56% | 0.56% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 80 columns
In the scientific_data_usage agreement non-null subset only 179 users chose to give their country name. 145 (81% of them chose Spain, 7 (4%) chose Argentina and 5 (3%) chose Mexico.
The frequency table and barplot of affiliate_code_signup is located below.
| Total | fitness_revolucionario | mammothhunters | endika | martina_ferrer_ | cristinamanyer | mariapelazas | keto_aove | pablo_kuhnert | nicotononpt | ... | gloriaalcalar | gloria_martinez | fullmusculo | eat2winmedia | dracaminodiaz | blanca | andreajuan | anabel_freyes | MyHixel | healthybyjane | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 11 | 3 | 3 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Percent | 100.0% | 27.27% | 27.27% | 18.18% | 9.09% | 9.09% | 9.09% | 0.0% | 0.0% | 0.0% | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
2 rows × 28 columns
In this data subset, total of the users that signed up by affiliate code is 11. Number of affiliate codes used are 6. Most frequent one is mammothhunters and fitness_revolucionario with count of 3 (27%) each.
The variables taken as boolean are:
| Frequency | Percent | ||
|---|---|---|---|
| Variable | factors | ||
| scientific_data_usage | |||
| False | 0.00 | 0.0% | |
| True | 1393.00 | 100.0% | |
| Total | 1393.00 | 100.0% | |
| newsletter_subscription | |||
| False | 292.00 | 20.96% | |
| True | 1101.00 | 79.04% | |
| Total | 1393.00 | 100.0% | |
| notifications_setting | |||
| False | 16.00 | 1.15% | |
| True | 1377.00 | 98.85% | |
| Total | 1393.00 | 100.0% | |
| training_days_setting | |||
| True | 1393.00 | 100.0% | |
| Total | 1393.00 | 100.0% |
In this subset, 1103 (79%) of users signed up for newsletter_subscription and 1379 (99%) agreed on notification_settings (notifications).
Table user_achievements contains 31765 observations where there can be multiple observations for each user. The data frame contains:
Below there are information about data types and non-null values.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 31765 entries, 0 to 31764 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 31765 non-null int64 1 user_id 31765 non-null int64 2 achievment_id 31765 non-null int64 3 created_at 31765 non-null datetime64[ns] 4 updated_at 31765 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(3) memory usage: 1.2 MB
In the analysis, only the last achievement of the user will be taken into consideration. Now, there is only 3625 observations ($11.4\%$ of whole table) and values for user_ID are finally unique. Below there are data types and non-null count of values for this subset.
<class 'pandas.core.frame.DataFrame'> Int64Index: 3625 entries, 4 to 31764 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3625 non-null int64 1 user_id 3625 non-null int64 2 achievment_id 3625 non-null int64 3 created_at 3625 non-null datetime64[ns] 4 updated_at 3625 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(3) memory usage: 169.9 KB
Treating achievement_id as numerical value, the summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below.
| count | mean | std | min | 25% | 50% | 75% | max | var | skewness | kurtosis | NULL count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| achievment_id | 3625.00 | 10.74 | 8.95 | 3.00 | 3.00 | 6.00 | 17.00 | 34.00 | 80.14 | 0.85 | -0.66 | 0 |
Mean achievement is 11, that is Gorilla, minimum achievement is 3 (Catepillar - the start one), maximum achievement is 34 (Just a regular folk) and median is 6 (Chipmunk). There is no NULL data.
On the other hand, there is a possibility to treat achievement_id as a categorical variable. Then, it can be seen that most of the users had achievement 3 (Catepillar) - 1154 (32%) users, then 4 (Snail) - 414 (11%) users and 5 (Turtle) - 235 (6%) users. The achievement that number of occurences is the smallest is 34 (Just a regular folk) - 2 (0.06%) users. Below there is a frequency table sorted in descending order and barplot with numer of occurrences of each achievement_id.
| Frequency | Percent | |
|---|---|---|
| Total | 3625 | 100.0% |
| 3 | 1154 | 31.83% |
| 4 | 414 | 11.42% |
| 5 | 235 | 6.48% |
| 24 | 156 | 4.3% |
| 23 | 128 | 3.53% |
| 13 | 119 | 3.28% |
| 6 | 117 | 3.23% |
| 17 | 103 | 2.84% |
| 14 | 102 | 2.81% |
| 22 | 88 | 2.43% |
| 25 | 78 | 2.15% |
| 9 | 76 | 2.1% |
| 7 | 76 | 2.1% |
| 16 | 70 | 1.93% |
| 21 | 68 | 1.88% |
| 15 | 67 | 1.85% |
| 10 | 57 | 1.57% |
| 8 | 57 | 1.57% |
| 26 | 51 | 1.41% |
| 19 | 51 | 1.41% |
| 11 | 51 | 1.41% |
| 29 | 46 | 1.27% |
| 33 | 46 | 1.27% |
| 20 | 45 | 1.24% |
| 32 | 40 | 1.1% |
| 12 | 34 | 0.94% |
| 28 | 34 | 0.94% |
| 27 | 29 | 0.8% |
| 18 | 18 | 0.5% |
| 31 | 13 | 0.36% |
| 34 | 2 | 0.06% |
In user_achievements table there is user_id, so it would be proper, to merge users and user_achievements tables (all of the values of id in users table and user_id values from user_achievements table that match).
From the merged tables, there are only 5 columns chosen:
Below there is a glimpse of this table.
| id_x | user_id | achievment_id | id_y | points | |
|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 1880 | 25884 |
| 1 | 247.00 | 747.00 | 3.00 | 747 | 100 |
| 2 | NaN | NaN | NaN | 3469 | 580 |
| 3 | NaN | NaN | NaN | 1876 | 0 |
| 4 | NaN | NaN | NaN | 1886 | 11014 |
| 5 | 29073.00 | 1264.00 | 4.00 | 1264 | 650 |
| 6 | NaN | NaN | NaN | 1875 | 0 |
| 7 | NaN | NaN | NaN | 1877 | 0 |
| 8 | 8306.00 | 8228.00 | 3.00 | 8228 | 350 |
| 9 | NaN | NaN | NaN | 1874 | 65338 |
It is seen that some of the users that have points, don't have achievements. Below are stated numbers that count all four situations that could happened there.
Have 0 points and achievement assigned: 0 Have 0 points and no achievement assigned: 9183 Have points and no achievement assigned: 5880 Have points and achievement assigned: 3625
Sum of users in all of these situations is equal to 18688 (equal to number of observations from users table). That means, if in the analysis would be used only the last possibility (have points and achievement assigned) there would be only 3625 observations. Every user should have some achievement assigned at start, so it would be best, to assign achievement to every user according to points from achievements table. Then, there would be the biggest number of observations to analyze.
Table user_programs contains 81321 observations where there can be multiple observations for each user. The data frame contains:
Below there are information about data types and non-null values. User_id and program_id will be treated as categories. Below, there is also table, that counts how many completions of programs are.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81321 entries, 0 to 81320 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 81321 non-null int64 1 user_id 81321 non-null category 2 program_id 81321 non-null category 3 created_at 81321 non-null datetime64[ns] 4 updated_at 81321 non-null datetime64[ns] 5 active 81321 non-null bool 6 current_session_id 81305 non-null float64 7 completed 81321 non-null bool 8 enjoyment 1599 non-null float64 9 enjoyment_notes 164 non-null object dtypes: bool(2), category(2), datetime64[ns](2), float64(2), int64(1), object(1) memory usage: 4.8+ MB
| Frequency | Percent | |
|---|---|---|
| False | 72227 | 88.82% |
| True | 9094 | 11.18% |
| Total | 81321 | 100.0% |
From the tables above, it is seen that only 9094 times programs were completed by users (11% of all started programs). From this 9094 times, only 1599 (18%) gave enjoyment feedback and only 164 times were given written feedback (2%).
Below there is a table with currently active programs.
| Frequency | Percent | |
|---|---|---|
| False | 66181 | 81.38% |
| True | 15140 | 18.62% |
| Total | 81321 | 100.0% |
There are 15140 (19%) currently active programs. The table and barplot below represents ten users, that started the biggest number of programs. Users that started the biggest number of programs are id 360 (programs started - 396), 706 (54), 989 (41), 1561 (38), 7094 (38).
| user_id | 360 | 708 | 989 | 1561 | 7094 | 875 | 8055 | 6271 | 7948 | 5093 |
|---|---|---|---|---|---|---|---|---|---|---|
| program_id | 396 | 54 | 41 | 38 | 38 | 37 | 37 | 36 | 36 | 36 |
It is possible to make a subset of only completed programs. Then the most programs were completed by user with id 3169 (18 completed programs). Second biggest number of completed programs by one user is 16 by user with id 2390.
| user_id | 3169 | 2390 | 2677 | 1718 | 2526 | 1799 | 7761 | 13552 | 1855 | 2013 | 1860 | 1285 | 3111 | 3216 | 1857 | 1350 | 2648 | 8165 | 1333 | 3214 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| program_id | 18 | 16 | 13 | 11 | 10 | 10 | 9 | 9 | 8 | 8 | 8 | 8 | 8 | 8 | 7 | 7 | 7 | 7 | 7 | 7 |
Table and barplot below show most frequently completed programs. Programs that users most frequently started have id 5, 36, 29, 504, 10, 30, 34, 6, 12, 38, 7.
| program_id | 5 | 36 | 29 | 504 | 10 | 30 | 34 | 6 | 12 | 38 | 7 | 428 | 23 | 39 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | 20984 | 20600 | 9131 | 7286 | 2207 | 1912 | 1600 | 1571 | 913 | 876 | 827 | 817 | 801 | 794 | 789 |
Table and barplot below show top 20 most frequently completed programs. Most frequently completed program is program number 504 with 7286 completions. Second most frequently completed program is program number 6 with 197 completions.
| program_id | 504 | 6 | 10 | 29 | 7 | 12 | 428 | 23 | 13 | 16 | 14 | 8 | 34 | 30 | 25 | 503 | 22 | 9 | 500 | 26 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | 7286 | 197 | 194 | 130 | 117 | 109 | 98 | 94 | 84 | 68 | 66 | 62 | 59 | 44 | 42 | 41 | 39 | 39 | 36 | 35 |
It is possible to merge tables with users (id) and user_programs (user_id) to get characteristics for specific groups, programs and whatever is connected. It is possible to compare gender, activity level, goal, body type, notification settings, language, BMI category and number of programs completed.
Firstly, connected table will be filtered on completed programs. Count of completed programs for each user and number of points is presented below.
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 3169 | 18 | 370089 | 24.24 | True | False |
| 2390 | 16 | 838205 | 25.77 | True | False |
| 2677 | 13 | 583894 | 30.67 | True | False |
| 1718 | 11 | 1810142 | 24.52 | True | False |
| 2526 | 10 | 308822 | 19.23 | True | False |
The biggest number of the completed programs is 18 for user 3169, who has weight in norm. What is curious, is that number of points for this user is not the biggest one. The biggest number of points is 2749450 for user 1442, who is overweight (BMI = 25.99) and completed only 2 programs (the head of this table is shown below). Both of the users, who have the biggest number of completed programs and the one with the biggest number of points, had notifications settings turned on (but none of them agreed on scientific data usage).
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 1442 | 2 | 2749450 | 25.99 | True | False |
| 2978 | 1 | 2741622 | 24.72 | True | False |
| 3061 | 3 | 2482727 | 21.91 | True | False |
| 889 | 2 | 2463230 | 28.40 | True | True |
| 3007 | 1 | 2305978 | 17.36 | True | False |
Below there are frequency tables for completed programs.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 2539.00 | 27.92% | 27.92% | |
| male | 6555.00 | 72.08% | 100.0% | |
| Total | 9094.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 1317.00 | 14.48% | 14.48% | |
| active | 5239.00 | 57.61% | 72.09% | |
| sedentary | 2538.00 | 27.91% | 100.0% | |
| Total | 9094.00 | 100.0% | - | |
| Goal | ||||
| lose | 3217.00 | 35.37% | 35.37% | |
| gain | 4398.00 | 48.36% | 83.74% | |
| antiaging | 1479.00 | 16.26% | 100.0% | |
| Total | 9094.00 | 100.0% | - | |
| Language | ||||
| en | 132.00 | 1.45% | 1.45% | |
| es | 8962.00 | 98.55% | 100.0% | |
| Total | 9094.00 | 100.0% | - | |
| Body_type | ||||
| thin | 3286.00 | 36.13% | 36.13% | |
| mid | 4930.00 | 54.21% | 90.35% | |
| strong | 878.00 | 9.65% | 100.0% | |
| Total | 9094.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 5656.00 | 62.33% | 62.33% | |
| Obesity | 658.00 | 7.25% | 69.58% | |
| Overweight | 2602.00 | 28.67% | 98.25% | |
| Underweight | 159.00 | 1.75% | 100.0% | |
| Total | 9075.00 | 100.0% | - |
In this subset of users, most of the completed programs, were completed by men - 6555 times (72%) and woman completed 2539 programs (28% of all completed programs). People with activity level active completed 5239 programs (58% of all completed programs), with level sedentary completed 2538 (28% of all completed programs) and with level vary active - 1317 completed programs (14% of all completed). People with goal of gaining weight completed 4396 programs (48%), with lose - 3217 programs (35%) and with antiaging goal 1479 programs were completed (16%). People with English language completed 132 programs (1%) and people that used Spanish in their app completed 8962 programs (99%). 4930 (54%) completed programs were completed by mid body type, people with thin body type completed 3286 (36%) programs and people with strong body type completed 878 programs (10%). Most of the programs was completed by people with normal weight - 5656 (62%), then second biggest group was overweight group with 2602 completed programs (29%), then people with obesity - 658 (7%) completed programs and the smallest group is for people with underweight - 159 (2%) completed programs.
Below there are frequency tables for boolean variables.
| Frequency | Percent | ||
|---|---|---|---|
| Variable | factors | ||
| scientific_data_usage | |||
| False | 8095.00 | 89.01% | |
| True | 999.00 | 10.99% | |
| Total | 9094.00 | 100.0% | |
| newsletter_subscription | |||
| False | 1896.00 | 20.85% | |
| True | 7198.00 | 79.15% | |
| Total | 9094.00 | 100.0% | |
| notifications_setting | |||
| False | 113.00 | 1.24% | |
| True | 8981.00 | 98.76% | |
| Total | 9094.00 | 100.0% |
Only users that did 999 (11%) agreed on scientific data usage. Users that completed 7198 (79%) programs agreed on newsletter subscription and users that completed 8961 (99%) programs turned on notification settings.
Below there will be a subset of data for people that agreed on scientific data usage. There are 20428 observations and only 605 observations are filled with enjoyment value.
<class 'pandas.core.frame.DataFrame'> Int64Index: 20428 entries, 0 to 81320 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 20428 non-null int64 1 program_id 20428 non-null category 2 active 20428 non-null category 3 completed 20428 non-null category 4 enjoyment 605 non-null float64 5 id 20428 non-null category 6 gender 20428 non-null category 7 height 20428 non-null float64 8 weight 20428 non-null float64 9 activity_level 20428 non-null category 10 goal 20428 non-null category 11 body_type 20428 non-null category 12 body_fat 20428 non-null float64 13 newsletter_subscription 20428 non-null bool 14 notifications_setting 20428 non-null bool 15 affiliate_code_signup 95 non-null category 16 language 20428 non-null category 17 country 2531 non-null category 18 points 20428 non-null int64 19 scientific_data_usage 20428 non-null category 20 BMI 20396 non-null float64 21 BMI_category 20396 non-null category dtypes: bool(2), category(13), float64(5), int64(2) memory usage: 2.2 MB
The tables below show connected and summarized user_programs and users tables. First table is sorted by count of started programs by users and the second one is sorted by number of points achieved.
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 989 | 41 | 526242 | 23.12 | True | True |
| 1561 | 38 | 568855 | 23.18 | True | True |
| 8055 | 37 | 42382 | 21.86 | True | True |
| 7948 | 36 | 4869 | 26.79 | True | True |
| 1102 | 33 | 8500 | 23.44 | True | True |
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 889 | 24 | 2463230 | 28.40 | True | True |
| 698 | 23 | 1149712 | 25.65 | True | True |
| 1416 | 20 | 1104077 | 25.43 | True | True |
| 2236 | 21 | 1044941 | 26.37 | True | True |
| 1331 | 13 | 1039974 | 23.30 | True | True |
User with the biggest number of started programs is user with id 969 (count = 41), then with id 1561 (count = 38), id 8055 (count = 37), id 7948 (count = 36) and id 1102 (count = 33). But in th second table it is seen that user with the biggest number of points have started 24 programs (id 889) and has 2463230 points and BMI says this person is overweight.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 9206.00 | 45.07% | 45.07% | |
| male | 11222.00 | 54.93% | 100.0% | |
| Total | 20428.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 2020.00 | 9.89% | 9.89% | |
| active | 10557.00 | 51.68% | 61.57% | |
| sedentary | 7851.00 | 38.43% | 100.0% | |
| Total | 20428.00 | 100.0% | - | |
| Goal | ||||
| lose | 9560.00 | 46.8% | 46.8% | |
| gain | 8337.00 | 40.81% | 87.61% | |
| antiaging | 2531.00 | 12.39% | 100.0% | |
| Total | 20428.00 | 100.0% | - | |
| Language | ||||
| en | 260.00 | 1.27% | 1.27% | |
| es | 20168.00 | 98.73% | 100.0% | |
| Total | 20428.00 | 100.0% | - | |
| Body_type | ||||
| thin | 9494.00 | 46.48% | 46.48% | |
| mid | 8285.00 | 40.56% | 87.03% | |
| strong | 2649.00 | 12.97% | 100.0% | |
| Total | 20428.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 11412.00 | 55.95% | 55.95% | |
| Obesity | 2525.00 | 12.38% | 68.33% | |
| Overweight | 5799.00 | 28.43% | 96.76% | |
| Underweight | 660.00 | 3.24% | 100.0% | |
| Total | 20396.00 | 100.0% | - |
In the subset of users who agreed on scientific data usage, most of the programs were started by men - 11222 (55%) and woman started 9206 programs (45% of all started programs). People with activity level active started 10557 programs (52% of all started programs), with level sedentary started 7851 programs (38% of all started programs) and with level vary active - 2020 started programs (10% of all started). People with goal of gaining weight started 8337 programs (41%), with lose - 9560 programs (47%) and with antiaging goal 2531 programs were started (12%). People with English language started 260 programs (1%) and people that used Spanish in their app started 20168 programs (99%). 8285 (41%) started programs were completed by mid body type, people with thin body type started 9494 (46%) programs and people with strong body type started 2649 programs (13%). Most of the programs was started by people with normal weight - 11412 (56%), then second biggest group was overweight group with 5799 started programs (28%), then people with obesity - 2525 (12%) started programs and the smallest group is for people with underweight - 660 (3%) started programs.
Below there will be a subset of data for people that agreed on scientific data usage. and completed programs There are 999 observations and 567 observations are filled with enjoyment value.
<class 'pandas.core.frame.DataFrame'> Int64Index: 999 entries, 14 to 81070 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 999 non-null int64 1 program_id 999 non-null category 2 active 999 non-null category 3 completed 999 non-null category 4 enjoyment 567 non-null float64 5 id 999 non-null category 6 gender 999 non-null category 7 height 999 non-null float64 8 weight 999 non-null float64 9 activity_level 999 non-null category 10 goal 999 non-null category 11 body_type 999 non-null category 12 body_fat 999 non-null float64 13 newsletter_subscription 999 non-null bool 14 notifications_setting 999 non-null bool 15 affiliate_code_signup 7 non-null category 16 language 999 non-null category 17 country 352 non-null category 18 points 999 non-null int64 19 scientific_data_usage 999 non-null category 20 BMI 997 non-null float64 21 BMI_category 997 non-null category dtypes: bool(2), category(13), float64(5), int64(2) memory usage: 756.3 KB
The tables below show connected and summarized user_programs and users tables. First table is sorted by count of completed programs by users and the second one is sorted by number of points achieved.
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 13552 | 9 | 7400 | 29.34 | True | True |
| 1860 | 8 | 152111 | 28.39 | True | True |
| 3216 | 8 | 573934 | 29.40 | True | True |
| 2013 | 8 | 590448 | 24.16 | True | True |
| 1855 | 8 | 592526 | 22.88 | True | True |
| count | points | BMI | notification_settings | scientific_data_usage | |
|---|---|---|---|---|---|
| user_id | |||||
| 889 | 2 | 2463230 | 28.40 | True | True |
| 698 | 5 | 1149712 | 25.65 | True | True |
| 1416 | 6 | 1104077 | 25.43 | True | True |
| 2236 | 4 | 1044941 | 26.37 | True | True |
| 1331 | 4 | 1039974 | 23.30 | True | True |
User with the biggest number of completed programs is user with id 13552 (count = 9), then with id 1860 (count = 8), id 3216 (count = 8), id 2013 (count = 8) and id 1855 (count = 8). In the second table it is seen that user with the biggest number of points have completed 2 programs (id 889) and has 2463230 points and BMI says this person is overweight. This person started 24 programs and finished only 2.
| Frequency | Percent | Cumulative Percent | ||
|---|---|---|---|---|
| Variable | factors | |||
| Gender | ||||
| female | 339.00 | 33.93% | 33.93% | |
| male | 660.00 | 66.07% | 100.0% | |
| Total | 999.00 | 100.0% | - | |
| Activity_level | ||||
| very active | 108.00 | 10.81% | 10.81% | |
| active | 557.00 | 55.76% | 66.57% | |
| sedentary | 334.00 | 33.43% | 100.0% | |
| Total | 999.00 | 100.0% | - | |
| Goal | ||||
| lose | 386.00 | 38.64% | 38.64% | |
| gain | 470.00 | 47.05% | 85.69% | |
| antiaging | 143.00 | 14.31% | 100.0% | |
| Total | 999.00 | 100.0% | - | |
| Language | ||||
| en | 23.00 | 2.3% | 2.3% | |
| es | 976.00 | 97.7% | 100.0% | |
| Total | 999.00 | 100.0% | - | |
| Body_type | ||||
| thin | 458.00 | 45.85% | 45.85% | |
| mid | 445.00 | 44.54% | 90.39% | |
| strong | 96.00 | 9.61% | 100.0% | |
| Total | 999.00 | 100.0% | - | |
| BMI_category | ||||
| Normal | 630.00 | 63.19% | 63.19% | |
| Obesity | 61.00 | 6.12% | 69.31% | |
| Overweight | 286.00 | 28.69% | 97.99% | |
| Underweight | 20.00 | 2.01% | 100.0% | |
| Total | 997.00 | 100.0% | - |
In this subset of users, most of the completed programs, were completed by men - 660 programs (66%) and woman completed 339 programs (34% of all completed programs). People with activity level active completed 557 programs (56% of all completed programs), with level sedentary completed 334 (33% of all completed programs) and with level vary active - 108 completed programs (11% of all completed). People with goal of gaining weight completed 470 programs (47%), with lose - 386 programs (39%) and with antiaging goal 143 programs were completed (14%). People with English language completed 23 programs (2%) and people that used Spanish in their app completed 976 programs (98%). 445 (45%) completed programs were completed by mid body type, people with thin body type completed 458 (46%) programs and people with strong body type completed 96 programs (10%). Most of the programs was completed by people with normal weight - 630 (63%), then second biggest group was overweight group with 286 completed programs (29%), then people with obesity - 61 (6%) completed programs and the smallest group is for people with underweight - 20 (2%) completed programs.